22 research outputs found

    Comparability, evaluation and benchmarking of large pre-trained language models

    Get PDF

    Pre-trained language models evaluating themselves - A comparative study

    Get PDF
    Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above

    CC-Top: Constrained Clustering for Dynamic Topic Discovery

    Get PDF
    Research on multi-class text classification of short texts mainly focuses on supervised (transfer) learning approaches, requiring a finite set of pre-defined classes which is constant over time. This work explores deep constrained clustering (CC) as an alternative to supervised learning approaches in a setting with a dynamically changing number of classes, a task we introduce as dynamic topic discovery (DTD). We do so by using pairwise similarity constraints instead of instance-level class labels which allow for a flexible number of classes while exhibiting a competitive performance compared to supervised approaches. First, we substantiate this through a series of experiments and show that CC algorithms exhibit a predictive performance similar to state-of-the-art supervised learning algorithms while requiring less annotation effort. Second, we demonstrate the overclustering capabilities of deep CC for detecting topics in short text data sets in the absence of the ground truth class cardinality during model training. Third, we showcase how these capabilities can be leveraged for the DTD setting as a step towards dynamic learning over time. Finally, we release our codebase to nurture further research in this area

    Exposure-lag-response associations between lung cancer mortality and radon exposure in German uranium miners.

    Get PDF
    Exposure-lag-response associations shed light on the duration of pathogenesis for radiation-induced diseases. To investigate such relations for lung cancer mortality in the German uranium miners of the Wismut company, we apply distributed lag non-linear models (DLNMs) which offer a flexible description of the lagged risk response to protracted radon exposure. Exposure-lag functions are implemented with B-Splines in Cox models of proportional hazards. The DLNM approach yielded good agreement of exposure-lag-response surfaces for the German cohort and for the previously studied cohort of American Colorado miners. For both cohorts, a minimum lag of about 2 year for the onset of risk after first exposure explained the data well, but possibly with large uncertainty. Risk estimates from DLNMs were directly compared with estimates from both standard radio-epidemiological models and biologically based mechanistic models. For age > 45 year, all models predict decreasing estimates of the Excess Relative Risk (ERR). However, at younger age, marked differences appear as DLNMs exhibit ERR peaks, which are not detected by the other models. After comparing exposure-responses for biological processes in mechanistic risk models with exposure-responses for hazard ratios in DLNMs, we propose a typical period of 15 year for radon-related lung carcinogenesis. The period covers the onset of radiation-induced inflammation of lung tissue until cancer death. The DLNM framework provides a view on age-risk patterns supplemental to the standard radio-epidemiological approach and to biologically based modeling

    ActiveGLAE: A Benchmark for Deep Active Learning with Transformers

    Full text link
    Deep active learning (DAL) seeks to reduce annotation costs by enabling the model to actively query instance annotations from which it expects to learn the most. Despite extensive research, there is currently no standardized evaluation protocol for transformer-based language models in the field of DAL. Diverse experimental settings lead to difficulties in comparing research and deriving recommendations for practitioners. To tackle this challenge, we propose the ActiveGLAE benchmark, a comprehensive collection of data sets and evaluation guidelines for assessing DAL. Our benchmark aims to facilitate and streamline the evaluation process of novel DAL strategies. Additionally, we provide an extensive overview of current practice in DAL with transformer-based language models. We identify three key challenges - data set selection, model training, and DAL settings - that pose difficulties in comparing query strategies. We establish baseline results through an extensive set of experiments as a reference point for evaluating future work. Based on our findings, we provide guidelines for researchers and practitioners.Comment: Accepted @ ECML PKDD 2023. This is the author's version of the work. The definitive Version of Record will be published in the Proceedings of ECML PKDD 202

    Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization

    Full text link
    Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. Their ever-increasing size, however, raised concerns about their effective deployment and the need for LLM compressions. This study introduces the Divergent Token metrics (DTMs), a novel approach for assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs focus on token divergence, that allow deeper insights into the subtleties of model compression, i.p. when evaluating component's impacts individually. Utilizing the First Divergent Token metric (FDTM) in model sparsification reveals that a quarter of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization FDTM suggests that over 80% of parameters can naively be transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually-and that FDTM can identify those-while standard metrics result in deteriorated outcomes

    How Different Is Stereotypical Bias Across Languages?

    Full text link
    Recent studies have demonstrated how to assess the stereotypical bias in pre-trained English language models. In this work, we extend this branch of research in multiple different dimensions by systematically investigating (a) mono- and multilingual models of (b) different underlying architectures with respect to their bias in (c) multiple different languages. To that end, we make use of the English StereoSet data set (Nadeem et al., 2021), which we semi-automatically translate into German, French, Spanish, and Turkish. We find that it is of major importance to conduct this type of analysis in a multilingual setting, as our experiments show a much more nuanced picture as well as notable differences from the English-only analysis. The main takeaways from our analysis are that mGPT-2 (partly) shows surprising anti-stereotypical behavior across languages, English (monolingual) models exhibit the strongest bias, and the stereotypes reflected in the data set are least present in Turkish models. Finally, we release our codebase alongside the translated data sets and practical guidelines for the semi-automatic translation to encourage a further extension of our work to other languages.Comment: Accepted @ "3rd Workshop on Bias and Fairness in AI" (co-located with ECML PKDD 2023). This is the author's version of the work. The definite version of record will be published in the proceeding
    corecore